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Abstract 

About 1 million expressed sequence tag (EST) sequences comprising 125.3Mb nucleotides were 
accreted from 51 cDNA libraries constructed from a variety of tissues and organs under a range of condi- 
tions, including abiotic stresses and pathogen challenges in common wheat (Triticum aestivum). Expressed 
sequence tags were assembled with stringent parameters after processing with inbuild scripts, resulting in 
37 138 contigs and 215 199 singlets. In the assembled sequences, 10.6% presented no matches with 
existing sequences in public databases. Functional characterization of wheat unigenes by gene ontology 
annotation, mining transcription factors, full-length cDNA, and miRNA targeting sites were carried out. 
A bioinformatics strategy was developed to discover single-nucleotide polymorphisms (SNPs) within our 
large EST resource and reported the SNPs between and within (homoeologous) cultivars. Digital gene ex- 
pression was performed to find the tissue-specific gene expression, and correspondence analysis was exe- 
cuted to identify common and specific gene expression by selecting four biotic stress-related libraries. The 
assembly and associated information cater a framework for future investigation in functional genomics. 
Key words: wheat; ESTs; annotation; transcription factors; miRNA; SNP; correspondence analysis 



1. Introduction 

Wheat provides 21% of food calories and 20% of 
protein to more than 4.5 billion people worldwide. 1 
Demand for wheat in the developing world is pro- 
jected to increase 60% by 2050. At the same time, 
climate change-induced temperature increases are 
estimated to reduce wheat production by 29%. 2 

The advent of new molecular genetic technology 
and the dramatic increase in plant gene sequence 
data have provided opportunities to underpin wheat 
breeding programmes in order to improve yield, grain 
quality, and disease resistance. 3 Many of these 
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technologies have been designed to facilitate detection 
and understanding of the alterations in gene expres- 
sion that accompany differential development or that 
result from the perception of changes to the environ- 
ment. Expressed sequence tag (EST) projects provide a 
very useful and quick means of accessing gene 
sequence and expression information. When combined 
with breakthroughs in highly parallel designs for gene 
expression analysis, large-scale EST projects now offer 
new perspectives for understanding the molecular 
basis of important traits in plants of agricultural 
relevance. 4 EST sequencing projects have been com- 
pleted or are under way for many plant species. These 
projects have provided useful tools for intragenomic 5 
and intergenomic 6 comparisons, gene discovery, 7-9 
molecular marker identification, 10 microarray 
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development, 11-15 and polyploid species genomic 
resource development. 1 6-1 8 As robot throughput 
increases and cost-per-read drops, determination of a 
sequence tag for a large proportion of genes is now 
reasonable using this random cDNA sequencing 
approach. Forexample, the availability of the complete 
genome sequence of Arabidopsis thaliana revealed that 
the 1 05 000 ESTs available at the end of 2 000 were 
enough to tag 60% of the 25 500 genes. 19 The 
complete genome sequences of several plant species 
are known and the rate at which whole genomes are 
being sequenced is increasing. Correct annotation of 
these genomes remains problematic despite gene 
prediction algorithms becoming ever more sophisti- 
cated. While wheat genome sequencing is rapidly pro- 
gressing (www.wheatgenome.org), a quicker and 
complementary approach to identifying a large 
number of wheat genes is EST and full-length cDNA 
sequencing. These resources will prove invaluable for 
annotatingthe genomes of wheat and other monocots 
and as substrates for transgenic improvement of crops. 
As in Arabidopsis and rice, these tools will prove to be 
critical in speeding up the genetic improvement of 
wheat. 

DNA markers constructed from ESTs are effective 
since they are contained within an exon region of 
genes that are actually expressed. Examination of 
DNA sequence databases permits a direct search for 
sequence polymorphisms and thus molecular 
markers. These polymorphisms are typically single- 
nucleotide polymorphisms (SNPs) or small inser- 
tions-deletions (indels). More importantly, the SNPs 
are identified in EST sequences, thus the polymorph- 
isms can be used to directly map functional, expressed 
genes, rather than DNA sequences derived from 
conventional RAPD and AFLP techniques, which are 
typically not genes. This has led to studies on 
linkage disequilibrium in genes to better characterize 
associations between phenotype and genotype. 20-22 
To identify an SNP from an EST database, the database 
must be composed of ESTs derived from different gen- 
otypes, followed by alignment of the same EST 
sequences from different genotypes. 23 SNP markers 
rely upon the underlying redundancy within EST col- 
lections and assume that distinct genotype of a 
plant genome will be represented within a collection. 

Earlier, we described extensive wheat EST resources 
including full-length sequences, 24-27 and the usage 
of wheat transcriptome analysis by making a custom 
microarray with ESTs. 28,29 Here, we extend our 
efforts and describe a collection of further ESTs and 
the complete functional analysis of the whole EST so 
far developed (~1 million ESTs). Our work is based 
on a set of cDNA libraries established from 51 
different tissues of interest varied from growth 



stages and biotic and abiotic stresses in 10 different 
cultivars. 

2. Materials and methods 

2. 7. Plant materials and cDNA library construction 

Eighteen libraries from various growth stages, 2 5 
libraries from abiotic stresses (cold, drought, saline, 
and mineral toxic), and eight libraries from biotic 
stresses (leaf rust, powdery mildew, and blast) were 
constructed from eight different wheat lines 
(Table 1). Out of 51 libraries, 20 libraries were 
newly constructed and included for this study. 
Double-stranded cDNAs were synthesized as previ- 
ously described. 24 cDNAs were ligated with 
pBlueScript SK(+) digested with EcoRI and Xhol After 
transformation by electroporation, transformed 
bacterial cells were initially cultured in the SOC 
medium for 1 h before culture at 3 7°C for 2 h in 
2x LB medium. Cultured cells were stored in 20% 
glycerol at -80°C until use. Transformed bacteria 
were randomly selected and plasmid DNA was 
extracted. 24 Inserted cDNAs were sequenced from 
both ends using dye terminator cycle sequencing 
(Applied Biosystems, Foster City, CA, USA). 

2.2. EST processing and assembly 

The chromatogram files were base called and 
quality trimmed using PHRED 30 with default para- 
meters. Vector, library linker-primer, and EcoRI 
adapter sequences were removed using 
CROSSMATCH. Repeat, ambiguous sequences (PHRED 
quality values <30) and poly (A) tails or poly (T) 
sequences (at most 10 bases) in the ESTs were 
trimmed. Subsequently, ESTs with sequences 
<30 bp were omitted from the final data set. The 
remaining high-quality sequence was used for 
further study. All sequence data are available from 
the DNA database of Japan (Table 1). The processed 
EST sequence files were combined and assembled 
into contigs using the CAP3 program 31 with a high 
and low stringency level (high 95% homology in a 
20 bp overlap; low 80% homology in a 1 5 bp 
overlap). Default CAP3 settings include -p 90 -h 20; 
the custom parameter settings used were -p 85 -h 
90. The CAP3 -p option specifies overlap per cent 
identity cut-off, while the -h option specifies the 
maximum alignment overhang percentage. 

2.2.1. Sequence annotation Using the BLAST 
program (BLASTX with a search threshold of 1e-5), 
the sequences of the contigs were searched against 
seven databases (NCBI's nr; http://www.ncbi.nlm.nih. 
gov/genbank, Uniprot; http://www.uniprot.org, RAP- 
DB; http://rapdb.dna.affrc.go.jp, RGAP; http://rice. 
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Libra ry 
name 


Genotype 


Stage 


Condition 




No. of 
EST 


Accession number 


whcs 


L.b 


Callus 


CiS 




11 cnc 
I I dUd 


L-Jb 1 o z U b — t-J b L b4bU, 
CJ627048-CJ632007 


whr 


CS 


Root 


GS 




1 9 227 


BJ2771 29-BJ287630 


whs 


L.b 


Seedling 


Lib 




1 3 Q C C 

I 3 3 bo 


HAUUUUU I — HAU I UUU4 


whdl 


CS 


Seedling crown 


GS 




1 2 761 


BJ221 844-BJ231 91 2 


whh 


CS 


Spike at heading 


GS 




20 648 


BJ255495-BJ266779 


whf 


CS 


Spike at flowering 


GS 




21 106 


BJ2431 95-BJ255494 


whoh 


CS 


Pistil at heading 


GS 




20 736 


BJ266780-BJ2771 28 


whpc 


CS 


Anther at meiosis 


GS 




11016 


CJ5761 97-CJ580898, 
CJ682880-CJ687382 


whhg 


CSMT4B 3 


Anther at meiosis 


GS 




9669 


CJ549536-CJ5541 32, 
CJ657247-CJ661 657 


whsh 


CS 


Young spikelet 


GS 




1 1 302 


CJ730709-CJ736986 


whyd 


CS 


Spikelet at late 
flowering 


GS 




1 4 708 


BJ300204-BJ31 2233 


whok 


CS 


DPA5 


GS 




1 2 1 59 


CJ570869-CJ5761 96, 
CJ677747-CJ682879 


whms 


CSDT3DL b 


DPA5 


GS 




1 2 036 


CJ565425-CJ570868, 
CJ6/24b2 — CJb77/4o 


whe 


CS 


DPA1 0 


GS 




1 9 200 


BJ231 91 3-BJ2431 94 


whdp 


CS 


DPA20 


GS 




1 3 455 


CJ523461 -CJ5291 79, 
CJ632008-CJ637596 


whsl 


CS 


DPA30 


GS 




1 5 522 


BJ287631 -BJ300203 


whsp 


CS 


Seedling 


GS 




1 2 783 


HX247045-HX247474 


whca 


CS(Sp5A) c 


Seedling 


GS 




794 


HX247475-HX257200 


wh kp 


CSKmppd d 


Seedling 


Grown under continuous 1 


ght 


1 2 950 


CI5541 33-CI 559953 
CJ661 658-CJ6671 90 


w h kv 

V V 1 1 1 VV 


CS 


CppH [ i no 


Grown under continuous 1 
treatment 


ght after cold 


1 5 360 


fT559954-CI5fi5424 
CJ6671 91 -CJ672461 


whpm 

VV 1 1 1 1 1 


KitakeM 3 54 

1\ I Lu J\t 1 1 J Jt 


Dormant sppH 

1 ' V J I J1IOIIL l 1 ' I„ U 


With water supply 




11671 


CI539482-C1544724 
CJ647722-CJ652633 


whei 


KitakeM 3 54 

1\ I Lu J\t 1 1 J J ^ 


Dormant sppH 

1 ' V J I J1IOIIL l 1 ' I„ U 


With water supply after wounded 


1 1 743 


CI534326-C1539481 
CJ642661 -CJ647721 


whsc 


Kitakeil 3 54 


Shoot 


Cold treatment after excision of grain 
part 


13 079 


CI58fi310-Cl591 77fi 
CJ692607-CJ697797 


whsd 


Kitakeil 354 


Shoot 


Dehydration 




1 1 897 


CJ591 777-CJ596845, 
CJ697798-CJ70261 5 


wh rd 


Kitakeil 3 54 

1\ i Lu J\t 1 1 J Jt 


Root 


Dehydration 




12 436 


ri580899-Cl586309 
CJ687383-CJ692606 


wh v3 




Shoot 


3 days cold condition 




1 0 069 


CI601 934-C1606680 
CJ707296-CJ71 1 731 


wh v 




Shoot 


1 6 days cold condition 




1 1 087 


CI59fi846-Clfi01933 
CJ70261 6-CJ707295 


whva 


Valuevskaya 


Shoot 


ABA treatment 




1 0 631 


CJ606681 -CJ61 1 586, 
CJ71 1 732-CJ71 5867 


whvd 


Valuevskaya 


Shoot 


Five days dehydration 




1 1 767 


CJ61 1 587-CJ61 6926, 
CJ71 5868-CJ720922 


whvh 


Valuevskaya 


Shoot 


Heat shock treatment 




10090 


CJ616927-CJ621 531, 
CJ720923-CJ72541 9 


whvs 


Valuevskaya 


Liquid cultured cells 


Liquid cultured cells 




1 2 327 


CJ621 532-CJ627047, 
CJ725420-CJ730708 



Continued 
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Library 
name 


Genotype 


Stage 


Condition 




No. of 
EST 


Accession number 


whrs6 e 


CS 


Root 


Salt stress for 6 h 




13 110 


HX01 0005-HX01 9847 


whss6 e 


cs 


Leaf 


Salt stress for 6 h 




13 312 


HX01 9848-HX0301 80 


whrs24 e 


CS 


Root 


Salt stress for 24 h 




1 2 949 


HX0301 81 -HX040252 


whss24 e 


cs 


Leaf 


Salt stress for 24 h 




1 2 487 


HX040253-HX050054 


whatl 6 


Atlas66 


Root 


No treatment 




20 51 9 


CJ773323-CJ797201 


whatlal 6 


Atlas66 


Root 


50 mM Al for 6 h 




28 795 


CJ82281 8-CJ848636 


whsct e 


Scout66 


Root 


No treatment 




27 71 7 


CJ797202-CJ82281 7 


whsctaF 


Scout66 


Root 


50 mM Al for 6 h 




29 824 


CJ848637-CJ872807 


whhb e 


Halberd 


Root 


No treatment 




22 488 


HX124648-HX143755 


whhbb e 


Halberd 


Root 


1 OmM boric acid for 24 h 




22 522 


HX143756-HX163093 


whcr e 


Cranbrook 


Root 


No treatment 




22 680 


HX1 63094-HX1 82808 


whcrb 6 


Cranbrook 


Root 


1 OmM boric acid for 24 h 




22 61 6 


HX1 82809-HX201 765 


wntnis 


Thatcher 


Seedling 


Infected with leaf rust 




30 307 


CJo/ zoUo — cjoy64yu 


whthkles e 


NILThatcher 


Seedling 


Infected with leaf rust 




24 701 


CJ896491 -CJ91 9993 


whchan 6 


Chancellor 


Seedling 


Infected with powdry mildew 




29 281 


CJ91 9994-CJ9441 55 


whchu 6 


NILChancellor 


Seedling 


Infected with powdry mildew 




28 799 


CJ9441 56-CJ9681 75 


whnr e 


Norin4 


Seedling 


No treatment 




3441 5 


HX050055-HX071 91 8 


whnrpr48 e 


Norin4 


Seedling 


Infected with blast strain Pr48 
for 4 days 


at 23X 


23 987 


HX071 91 9-HX08471 6 


whnrpr58r e 


Norin4 


Seedling 


Infected with blast strain Pr58 
for 4 days 


at 23 C 


36 360 


HX08471 7-HX1 06894 


whnrpr58s e 


Norin4 


Seedling 


Infected with blast strain Pr58 
for 4 days 

Total 


at 27 C 


30 797 
894 756 


HX1 06895-HX1 24647 



CS, Chinese Spring; GS, growth stage; DPA, days to post-anthesis; NIL, near-isogenic lines. 

a Mono-telosomic 4BS of CS. 

b Ditelosomic 4BS of CS. 

c Spelta5A chromosome substituted in CS. 

d Near-isogenic line. 

e Newly constructed libraries. 



plantbiology.msu.edu, Tair9; http://www.arabidopsis. 
org, MaizeGDB; http://www.maizegdb.org, and 
Brachypodium database; http://db.brachypodium. 
org). According to Ewing et al., 7 only contigs were 
taken for further analyses. The gene ontology (GO) 
terms 32 of each contig was derived by 
InterProScan. 33 The GO terms were then converted 
into GO slim term using EBI website (http://www. 
ebi.ac.uk/QuickGO/) by written perl-script for this 
purpose. Open reading frames (ORFs) were searched 
by translating sequences into amino acids by six 
frames (three per frame in the plus and minus 
strands). 



used to mine our data by local BLAST. The default 
parameters mentioned in the database was used for 
prediction (Filter 'on', gapped alignment 'on', substitu- 
tion matrix 'blosum 62', E-value <1e-10). We 
included two meta-rules in our classification 
scheme: (i) if a protein harbours domains character- 
istic of a transcription factor (TF) family and a tran- 
scriptional regulators (TR) family, we assigned it to 
the TF family, (ii) when the protein of interest con- 
tains domains characteristic of more than one TF 
family or more than one TR family, it was assigned 
to the family to which its characteristic domains 
matched with the lowest £-value. 



2.3. Transcription factor 

The PlnTFDB 34 containing 29 473 sequences of 
plant genes involved in transcriptional control was 



2.4. Full-length cDNA 

The contigs were classified as full length if it 
aligned with our full-length cDNA data. 27 BLASTN 
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searches of the contigs against the full-length 
sequences yielded a candidate hit list (E-value 
<1e-100) of putative full-length sequences that 
either covered the start and stop codon of the 
subject sequence or possessed sufficient sequence 
up/down-stream of the match to contain putative 
start and stop signals. In a few instances, some 
contigs covered all but the start methionine, and 
were also included as full-length sequences. In 
addition, contigs were aligned (E-value <1e-5, 
>9 8% query coverage, >9 8% identity) with barley 
full-length sequences 35 to know the similarity as 
well as the full-length nature. 

2.5. miRNA analysis 

To identify conserved miRNA in wheat, contigs were 
annotated with the plant small RNA regulator 
target analysis database (http://plantgrn.noble.org/ 
psRNATarget/) containing small RNA of 1 5 plant 
species including wheat. 36 This database contained 
2192 published miRNA sequences, including 32 
from Triticum aestivum, 1 48 from sorghum, 496 from 
Oryza sativa, 31 9 from maize, and 2 24 from A. thali- 
ana. Potential targets were predicted according to the 
rules applied by 37,38 : (i) the number of allowed mis- 
matches at complementary sites between miRNA 
sequences and potential mRNA targets is four or 
fewer; and (ii) no gaps are allowed at the complemen- 
tary sites. 

2.6. SNP discovery 

Sequence variants or SNPs were mined in wheat 
contigs with two criteria, and perl-scripts were 
written for each category. In the first criterion, only 
contigs with >4 ESTs were selected, and SNPs were 
declared only when there was no mismatch, no 
gaps, or N's were admitted before and after an SNP 
site; in addition, the alternative base to the consensus 
sequence was present at least more than twice in an 
alignment. In the second criterion, the SNPs were 
mined only in the significant sequence of contigs 
that was worked out by counting the nucleotides of 
either end of the contigs containing a minimum of 
four EST members. To find the SNP between cultivars, 
in addition to the above parameters, the contigs con- 
taining the minimum of two consistent EST from the 
same cultivar were selected. For homoeologous SNPs, 
the contigs containing the minimum of four EST from 
the same cultivar were chosen. The visual inspection 
of SNP was carried out using Tablet software. 39 

2.7 . Digital gene expression and correspondence 
analysis 

For statistical analysis of gene expression profiles, 
contigs harbouring five or more constituents were 



selected from 37 1 38 contigs. Similarities between 
contigs or libraries were estimated using Pearson's 
correlation coefficient. 40 Hierarchical clustering was 
applied to compare EST expression profiles among the 
51 wheat tissue/treatments and libraries. Expression 
profiles are displayed based on the numberof constitu- 
ents in a contig (from 0 to 4647; red intensity), along 
with an increasing number of constituents. Contigs 
specific to DREB (dehydration-responsive element 
binding), NAC (nitrogen assimilation control), OMT 
(O-methyl transferase), and miRNA 1 72 were selected 
toshowthedifferential gene expression in cultivarsand 
growth stage. 

Correspondence analysis (CA) was carried out by 
selecting four disease-related libraries (whthls, 
whthkles, whchan, and whchul; Table 1) as per the 
procedure detailed in Hamada et al. 4 ^ and visualized 
by a custom-build viewer (available based on request). 

3. Results 

3.1 . cDNA library construction and assembly 

Our previous studies carried out the construction 
and analysis of 31 libraries. 26 Here, we have reported 
the further addition of 2 0 libraries and their com- 
bined analyses for comprehensive view of wheat EST. 
At the maximum, we have accumulated ~1 million 
ESTs (Table 1). The libraries were generated from 
developmental stages and stresses. After trimming 
low-quality bases, vector sequences, and shortness 
(<30 bp), 0.68 million ESTs were used for CAP3 
assembly under stringent conditions resulting in 37 
138 contigs and 215 199 singlets. When assembled 
using relaxed settings in CAP3, 65 426 contigs and 
66 875 singlets were obtained. The high stringent 
condition was chosen to achieve a more complete 
isolation of individual paralogues, orthologues, and 
homoeologues compared with using a low stringent 
condition. 

The total sequence of transcript assemblies in strin- 
gent parameter settings, containing both the singlets 
and contigs, developed in this study was 125.3 Mb 
with theGC% of 51 .9%. This is the maximum transcrip- 
tome sequences developed in any plant species. The 
GC% value is similar to rice but less than that reported 
in the wheat 3 B chromosome exon coding sequence. 42 
The length of the singlets varied from 31 to 884 bp 
with an average of 430 bp. The maximum singlets 
were grouped under 500-600 bp lengths (Fig. 1 ). As 
we found a large number of singlets, we subsequently 
discriminated the singlet contribution among the 1 0 
cultivars (Fig. 1 B). There was no correlation observed 
between the number of ESTs and the singlets. 
However, the stress-related libraries from four cultivars 
contributed to 55% of the total singlets. The contig 
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Sequence length (bp) 




ll.l.l 



AT SC CC TC CR HB CS KT NR W 
Cultivar 



Figure 1. Analysis of singlet sequence length and their genotype-wise distribution. (A) Sequence length distribution of singlet. (B) 
Genotype-wise frequency (%) of singlet (AT, Atlas; SC, Scout; CC, Chancellor; TC, Thatcher; CR, Cranbrook; HB, Halberd; CS, Chinese 
Spring; KT, Kitakei! 354; NR, Norin4; VV, Valuevskaya). 




length ranged from 46 to 3960 bp with an average of 
879 bp, and ~70% of the contigs extended from 501 
to 1 000 bp (Fig. 2A). The number of ESTs grouped in 
each contig varied between 2 and 4647, with 78% of 
the contigs containing 2-10 EST members (Fig. 2B). 

3.2. Functional annotation 

The contig resulted from stringent parameter 
assembly was used for functional annotation. The 
function of each contig was derived after annotation 
with rice, Arabidopsis, maize, and Brachypodium data- 
bases, in addition to the protein sequences in the 
GenBank nr database (BLASTX; £-value <1e-5). 
The recently sequenced Brachypodium was included 
for its close originated relation with wheat. On anno- 
tation, maximum similarity was observed in 
Brachypodium followed by rice (Table 2). In rice, 
further annotation was carried out to find the 
chromosome-wise sequence similarity and identified 
that chromosome 1 is having much co-linearity fol- 
lowed by chromosome 3 (Fig. 3). On overall annota- 
tion, ~3 500 genes were found to have no similarity, 
suggesting new genes in our data. To further validate 



these new genes, updated tentative consensus 
sequences from the DFCI wheat gene index (http:// 
compbio.dfci. harvard.edu/cgi-bin/tgi/gimain. pl?gudb= 
wheat) were annotated and resulted in the same 
number of new genes (~ 3 5 00), confirming the import- 
ance of our new EST assembly and analysis in wheat. To 
estimate the total number of full-length cDNAs in our 
collection, we searched our contig data against our 1 1 
902 full-length cDNA data. With the stringent criteria 
of >95% similarity and the expected cut-off value of 
<1e-100, we found ~7000 contigs that were full 
length in nature, further validated by identification of 
high similarity of wheat contigs with barley full-length 
sequences, indicating the robustness of our data and 
their applications to wheat functional genomics. 

We also analysed length and ORF distribution in 
the contigs for both plus and minus strands. To 
obtain meaningful results, only contigs with >5 ESTs 
were selected and ORFs were identified. GO annota- 
tion of the wheat contigs was performed on the 
basis of ORF mining of the data. The GO terms were 
organized into three categories representing 
molecular functions, biological processes, and cellular 
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Table 2. Annotation of wheat contigs 



Database and 


URL 


No. and percentage 


Species 




of similarity 


KAr-DB (Duild 5 ) 


http://rapdb.dna.affrc. 


32.405 (8/.zb%) 


(Ricel 


0O iD 




kuAI j (Kicej 


http://rice. 






plantbiology.msu.edu 




TAIR9 


n1"+ n* / A An Art A/ 
I 1 LL|J. / / WWW. 


jU.jUt \_ O Z. . I *+ /oj 


(Arabidopsis) 


arabidopsis.org 




MaizeGDB 


http://www.maizegdb. 


31 .206 (84.03%) 




org/ 




Brachypodium 


http://db. 


32.522 (87.57%) 


Database 


brachypodium.org 




All databases 




33.909 (91.31%) 


Total contigs 




37.1 38 



The updated (until May 201 1) sequence was retrieved and 
similarity search was carried out. 



6000 - 



5000 




Figure 3. Sequence similarity of wheat contigs with rice genome. 
Based on the result, the contig was grouped in rice 
chromosome wise. 

components. 32 The sum of the wheat contigs per 
category did not add up to 1 00% as some contigs 
were classified into more than one category. Of the 
total contig set, 21 1 25 (56%) were annotated into 
the molecular function category (describing the 
biochemical activity performed by the gene 
product), 13 354 (36%) into the biological process 
GO category (describing the ordered assembly of 
more than one molecular function), and 1 3 356 
(36%) into the cellular component GO category 
(describing subcellular compartments of a cell) 
(Fig. 4). Among the molecular function, the most 
highly represented categories were binding, catalytic 
activity, redox activity, and structural activity 
(Fig. 4A). Among the biological processes, the largest 
proportion of functionally assigned contigs fell into 
metabolic, transport, and translation processes, 
while redox activity, biosynthetic process, regulation, 
phosphorylation, and transcription comprised 34% 
of the contigs (Fig. 4B). For the cell component 
category, almost all contig sequences were annotated 
into the cell-cell subcategory, 28% into the 



membrane category, and 2 3% into the intracellular 
category (Fig. 4C). Together, all three GO categories 
accounted for ~82% of the assigned wheat contig set. 

The role and importance of TF lead to mining of 
our data and resulted in 1183 contigs containing 
either single or multiple transcription factors. 
Among the TFs, the CCAAT family was found in as 
many as 69 contigs. miRNA target sequence analysis 
of the wheat transcriptome identified different 
miRNA target sequences in 5180 contigs which 
ranged from 1 9 to 24 nt long. The majority of the 
small RNAs are 20-24 nt long, which is a typical 
range for dicer-derived products; the 21 -nt class is 
predominant. Among species-specific miRNA, rice 
had maximum homology followed by maize and 
Medicago truncatula (Fig. 5). The number of hits for 
each species is roughly proportional to the number 
of sequences for that species in the database. Due to 
the limited number of wheat miRNA sequences in 
the database, there was only 200 contigs with 
wheat-specific miRNA target sequences. Among 
miRNAs, miRNA 395, 1 72, and 1 64 target sequences 
alone were found in 831 contigs, showing the relative 
abundance of these miRNA target sequences in 
wheat. 



3.3. Sequence polymorphism /SNP mining 

The SNPs were mined in our large-scale wheat 
transcriptome data by applying both relax and strin- 
gent criteria. For both criteria, to find reliable SNPs, 
the common conditions of SNP should be present at 
a given position when there is no mismatch present 
2 bp before or after the SNP site. A total of 51 067 
SNPs were detected from 20 609 contigs using the 
first criterion, resulting in the identification of an 
SNP site every 96 bp. This value is considerably 
higher than those reported for earlier studies of 
wheat. 23,43 Hence, the second criterion was applied 
with an interest to differentiate the homoeologous 
SNPs from intergenome SNP by calculating the signifi- 
cant sequence length of each contig before mining 
the SNPs. This approach avoided finding SNPs on 
either end of the contigs and resulted in the identifi- 
cation of only 6352 SNPs in the wheat contigs. 
Further classification of the SNPs present between cul- 
tivars found 3515 SNPs with a frequency of one SNP 
per 614 bp. As there were genome constituents of 
wheat (three homologous genomes) and selective 
gene expression among the three genomes, we exam- 
ined the SNP within each cultivar and found 2837 
SNPs with an SNP site every 470 bp. The overall SNP 
frequency based on the stringent criteria was one 
SNP per 483 bp. Transitions (70%) were more 
frequent than transversions (30%). As expected, a 
significant positive correlation (P<0.05) was 
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Figure 4. Functional classification of contig sequences based on GO categorization. Sequences were evaluated for their predicted 
involvement in molecular function, biological process, and cellular component. 
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Figure 5. miRNA target sequence analysis in wheat contigs. The 
database bars indicate the available miRNA in the database 
and the hit bars indicate the number of wheat genes having 
miRNA target sequence, tae, Triticum; sbi, Sorghum bicolor, osa, 
Oryza sativa; zma, Zea mays; ath, Arabidopsis thaliana; mtr, 
Medicago truncatula; ghr, Gossibium hirsutum; ptc, Popuius 
trichocorpa; bna, Brassica napus; gma, Glycine max; pta, Pinus 
taeda; sly, Solatium lycopersicum; bra, Brassica rapa; bol, Brassica 
oleraceae; ere, Cblamydomonas reinhardtii. 

observed between the number of SNPs detected in a 
contig and the number of reads present in that 
contig. The cultivars, except Chinese spring and 
Norin4, contain more inter-cultivar sequence vari- 
ation than homoeologous SNPs. Among cultivars, 
Halberd, Valuevskaya, and Cranbrook contain almost 
the same number of SNPs in both cases. Cultivars 



Atlas, Kitakei 1 354, Scout, and Thatcher had much 
less homoeologous sequence variation than cultivar 
differences (Supplementary Fig. S1). 

3.4. Digital gene expression 

EST frequencies approximate the message abun- 
dance in the mRNA population used to construct a 
cDNA library. We have already attempted to make 
tissue expression maps of a large number of ESTs 
from stress-related libraries for in silico screening of 
stress responsive genes in wheat. 26 Here, we aimed 
to determine the global gene expression of wheat 
from 51 cDNA libraries, including growth stages and 
biotic and abiotic stresses. Contigs containing >5 
ESTs were subjected to a correlated clustering 
analysis 7 to compare expression profiles in the 
different libraries. When the result was displayed in 
the form of a dendrogram (Fig. 6), many libraries 
with similar origins agglomerated together. All four 
libraries derived from root tissues treated with boric 
acid and aluminium united. In the same manner, 
biotic stress (leaf rust, powdery mildew, and blast)- 
related tissues were grouped in the same clade. The 
tissues collected from cv. Valuevskaya, which mainly 
undergo abiotic stress, were clustered separately. 

To further determine the tissue-specific gene 
expression of select genes, digital gene expression 
was carried out for DREB and NAC TFs, OMT gene, 
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Figure 6. Correlated clustering of wheat cDNA libraries based on 
gene expression (AT, Atlas; SC, Scout; CC, Chancellor; TC, 
Thatcher; CR, Cranbrook; HB, Halberd; CS, Chinese Spring; KT, 
Kitakei 1 354; NR, Norin4; VV, Valuevskaya). 



and miRNA 1 72 targeting site genes in wheat tran- 
scripts (Supplementary Fig. S2). The DREB genes are 
expressed mainly in dehydration-related tissue 
libraries and the spatial expression was mostly at 
root tissues. In the case of NAC TF genes, as expected, 
the expression was only noted in salt-treated libraries 
and the expression level was greater in root tissue fol- 
lowed by shoot, spikelet, and seed. Interestingly, some 
biotic stress-related libraries also expressed NAC TF. 
Based on the preliminary result obtained from our 
previous study on OMT against self-defence in 
wheat, 29 the OMT-related contigs were mined and 
its gene expression was analysed. We found ubiquitous 
expression of OMT genes irrespective of stress, sug- 
gesting its defence role against stress. While mining 
of miRNA in wheat transcriptome, we found an abun- 
dance of miRNA 1 72 target sites. The digital gene 
expression analysis showed its abundance in all 
tissues and under all treatments in wheat. 



In addition to digital expression, a new method of 
gene discovery and/or gene expression based on CA 
was carried out to determine specific and common 
gene expression between libraries or treatments. 
To identify the common molecular plant-athogen 
interactions, four libraries constructed for leaf rust 
and powdery mildew diseases were selected and 
examined by CA. Contigs with more than or equal to 
four ESTs were selected, to identify specific genes for 
leaf rust and powdery mildew diseases, in addition 
to common disease resistance- and susceptibility- 
related genes (Supplementary Table S1). The 
number of genes expressed for powdery mildew out- 
played the leaf rust disease. However, the number of 
disease susceptibility-related genes was more in leaf 
rust, suggesting different disease reaction mechan- 
isms among diseases in wheat. Overall around 1 00 
new genes were identified from these four disease- 
related libraries, which could have immense value 
for future research of molecular plant-pathogen 
interactions. 



4. Discussion 

Global wheat transcriptome analysis was carried 
out by accumulating ~1 million ESTs from 51 cDNA 
libraries. In comparison with other studies of wheat 
which were biased towards one stress or growth 
stage of a few cultivars, 3,26 here we used all growth 
stages, and biotic and abiotic stresses for 1 0 different 
cultivars. The work flow of EST assembly and further 
analyses were summarized in Fig. 7. Many perl-script 
programs were written specifically for processing the 
ESTs, resulting in a 24% reduction in total ESTs. The 
stringent parameter in EST assembly resulted in 
more singlets compared with contigs which helped 
for further analysis, i.e. SNP mining. 23 The average 
length of the contigs (879 bp) is higher than other 
studies, 3,23,44 and ~80% of the contigs containing 
2-10 EST members had homoeologous or paralo- 
gous genes. To determine the EST member contribu- 
tion to the contig, we further classified the data into 
stress- and growth stage-related parameters 
(Supplementary Fig. S3). Among the 20 stress- 
related libraries, biotic stress-related libraries contrib- 
uted more ESTs to the contigs than libraries for abiotic 
stress (Supplementary Fig. S3A). Among the growth 
stages, libraries of the spikelet at late flowering con- 
tributed more than those of other stages, suggesting 
differential gene expression among the cultivars with 
or without any stress (Supplementary Fig. S3B). The 
high number of unigenes and GC% also suggests the 
possibility of more genes present in wheat than in 
other crops. 45 This is supported by the recent study 
of the megabase level sequencing in 3B, which 
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Figure 7. Schematic diagram explaining the comprehensive EST analysis. The software used in the respective step was mentioned in 
parallel. 



estimated 50 000 genes per diploid genome as a 
result of the additional non-collinear genes inter- 
spersed within the highly conserved ancestral grass 
gene backbone, suggesting accelerated evolution in 
the Triticeae lineages. 44 The presence of additional 
genes was further confirmed by identification of 
new genes based on functional annotation (Fig. 3). 
When putative wheat gene sequences were analysed 
for ORF length based on their hit status, we observed 
significantly shorter ORFs in sequences with no hits. 
These results suggest that ORF length, not sequence 
length, is a better indicator of finding transcripts 
with protein coding capacity and subsequently 
getting a hit in a sequence database. On the other 
hand, more than one-third of the sequences without 
a hit still contained an ORF >300 bp, suggesting 
that sequences without a hit but with a relatively 
long ORF may represent new genes with protein 
coding capacity. We confirmed this by finding more 
full-length cDNA sequences. Our results also showed 
a higher no hit percentage in singlet sequences, 
most likely due to the fact that singletons represent 
rare genes in the wheat genome that are not well 
described in other organisms. 

4. 1 . Functional characterization 

GO analysis revealed expected categories such as 
molecular, biological, and cellular processes (Fig. 5). 
In wheat, the major molecular processes were 
binding and catalytic activities, similarly found in 
other Poaceae species. 44,46,47 Metabolic, transport, 
and translation functions accounted for 50% of the 
biological processes. Among cellular processes, as 



expected, membrane, intracellular, and ribosome 
functions played a major role. Using data from the 
plant TF database, 34 we found 1183 contigs from 
wheat that had a high similarity with 2 1 97 different 
coding sequences of TF from seven species; the high 
similarity percentage was found with rice. The 
number of TF found in wheat contigs was higher 
than reported in Salvia sclorea calyx which was 
sequenced using 454 pyrosequncing. 48 The most 
represented TF family in wheat was CCAAT. The 
CCAAT box is a common c/s-acting element found in 
the promoter and enhancer regions of a large 
number of genes in higher eukaryotes (for 
review). 49,50 In addition to this TF family, several 
other TF families known to be involved in plant devel- 
opment were also present in our data. 

The role of miRNAs in developmental and stress 
regulation is not well established in wheat, and 
increasingly tissue-specific and developmental regula- 
tion of miRNAs is being found mostly in animal 
species. 51 Through cDNA sequencing efforts, we 
have identified transcripts that encode 945 different 
miRNAs, 52 although additional wheat-specific 
miRNAs may still remain in the cDNA collection. 
Indeed, a large number of ncRNAs have the potential 
to form miRNA-like stem loop precursors (data not 
shown), but experimental validation of these poten- 
tial miRNA is required. In our study, we found an 
abundance of miRNA 1 72 target sites in as many as 
236 contigs, and their uniform expression irrespective 
of growth stage and/or stress was confirmed by in 
silico gene expression analysis (Supplementary Fig. 
S2D). 
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The digital gene expression pattern of genes 
related to biotic and abiotic stresses (OMT, DREB, 
and NAC TFs), and epigenetic gene silencing has 
helped us to determine their quantitative and qualita- 
tive gene expression patterns. 7 This approach permits 
both the association of tissues via their common pat- 
terns of gene expression and the association of genes 
via their tissue-dependent expression patterns. The 
correlation clustering of 51 libraries formed the 
groups of the libraries based on respective treatments 
or stages which confirmed the importance of libraries 
as well gene expression specific to treatment. 26 

CA is an explorative computational method for the 
study of associations between variables. Much like 
principle component analysis, it displays a low dimen- 
sional projection of the data, e.g. into the plane with 
three-dimensional view (Supplementary Fig. S4), 
which can be achieved for two to three variables 
simultaneously, thus revealing associations between 
them. Traditionally, CA has been used prevalently in 
categorical data in the social sciences, but its applica- 
tion has been extended also to physical quantities and 
to proteomics. 53 This method allows us to quickly 
analyse the set of EST libraries and to discover 
molecular pathogenicity on wheat. In four disease- 
related libraries, ribulose 1 -5 biphosphate and S-ade- 
nosyl methionine genes were commonly found. In leaf 
rust, UDP-glucosyl transferase and chlorophyll a-b 
binding protein were highly expressed, 29 while in 
powdery mildew treatment, ADP-ribosylation factor, 
lipid transfer protein, and oxalate oxidase were specif- 
ically expressed. The advantage of CA analysis helped 
to find the common molecular resistance mechanism 
in wheat. 54 While comparing susceptible and resist- 
ance reaction mechanisms, some of the genes, such 
as 40S, 60S ribosomal protein and Zinc finger 
domain-containing protein genes, have copy 
number variations between the two mechanisms. 
We have shown that the application of CA to EST 
data provides an informative and concise means of 
visualizing these data, being capable of uncovering 
relationships both among either gene and between 
genes, in particular or common stages. 

4.2. SNP mining 

Assembly of EST sequences into contigs in a poly- 
ploidy species like hexaploid wheat results in each 
contig being composed of ESTs from homoeologous 
loci and members of gene families. 23 SNPs are the 
most abundantly found co-dominant polymorphic 
sites in greater proportion both in intronic and in 
exonic regions of the genome. They occur with vari- 
able frequencies and have become very popular in 
plant genetics and breeding due to their amenability 
for high throughput genotyping. In continuation of 



our earlier study to discriminate homoeologous 
gene expression of hexaploid wheat by SNP analysis 
of contigs grouped from 10 cDNA libraries from 
Chinese Spring, 43 we have mined our large-scale 
data to find SNPs from 1 0 different cultivars with 
various stress treatments. With relaxed criteria, we 
estimated ~50 000 SNPs in wheat with an SNP 
frequency of one SNP per 96 bp— a number that is 
higher than our previous report. 43 This high number 
might be due to the EST originating from 1 0 different 
cultivars with various stress conditions, although the 
possibilities of over-estimation from the end 
sequences could not be excluded. Hence, a new 
criterion was applied by calculating the significant 
sequence length to avoid the end sequence and 
sequencing error-based SNPs. This approach 
accounted for only 20% of the SNPs obtained by the 
initial approach, and could lead to an underestima- 
tion of nucleotide diversity, although it guarantees 
the elimination of false positives. The stringent 
parameter resulted in the SNP frequency of one SNP 
in every 483 bp. In comparison of SNPs among huge 
genome size species, coffee has one SNP every 
222 bp; 55 sugarcane has one SNP every 290 bp; 16 
cotton has one SNP every 500 bp; 17 oak has a 
frequency similar to cotton with one SNP every 
471 bp. 18 Our result in wheat compared with other 
species could explain the low polymorphism found 
in polyploidy species. The comparison of SNPs found 
between and within cultivars showed higher SNP 
frequency between cultivars (Supplementary Fig. 
S1), confirming that the low level of polymorphism 
identified between homoeologous genomes com- 
pared with inter-genome differences could be useful 
to select parents for linkage mapping studies. 



5. Conclusion 

The global wheat EST assembly presented here 
provides an unprecedented look at the wheat tran- 
scriptome and contributes tools for wheat genetics 
and genomics effort. The development and inclusion 
of cDNA libraries from all growth stages, various 
tissues, and treatments portray the complete picture 
of wheat transcriptome. Functional annotation and 
characterization give new ideas about wheat 
expressed genome, at least in part. The identified 
SNPs are invaluable resources for functional genomics 
and molecular breeding application. This set of pro- 
cessed EST sequences provides a seed for future inves- 
tigation of wheat functional genomics using both long 
and short oligonucleotide arrays. Our data will thus 
act as a backbone for wheat genome sequence 
assembly, which is progressing rapidly. 
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